Scheduling Data Intensive Workflow Applications based on Multi-Source Parallel Data Retrieval in Distributed Computing Networks
نویسندگان
چکیده
Many large-scale scientific experiments are carried out in collaboration with researchers and laboratories located around the world so that they can leverage expertise and high-tech infrastructures present at those locations and collectively perform experiments quicker. Data produced by these experiments are thus replicated and gets cached at multiple geographic locations. This necessitates new techniques for selection of both data and compute resources so that executions of applications are time and cost efficient when using distributed resources. Existing heuristics based techniques select ‘best’ data source for retrieving data to a compute resource and then carry out task-resource assignment. But, this approach of scheduling, which is based only on single source data retrieval, may not give time (and cost) efficient schedules when: 1) tasks are interdependent on data (workflow), 2) average size of data processed by every task is large, and 3) data transfer time exceeds task computation time by at least an order of magnitude. To achieve time efficient schedules, we leverage the presence of replicated data sources to retrieve data in parallel from multiple sources by incorporating the technique in our scheduling heuristic. In this paper, we proposed multi-source data retrieval based scheduling heuristic that assign interdependent tasks to compute resources based on both multi-source parallel data retrieval time and task-computation time. We carried out scheduling experiments by modeling applications from life sciences and astronomy domains and deploying them on both emulated and real testbed environments. Hence, with a combination of data retrieval and task-resource mapping technique, we showed that our heuristic can achieve time-efficient schedules that are better than existing heuristic based techniques, for scheduling application workflows.
منابع مشابه
Scheduling Workflow Applications Based on Multi-source Parallel Data Retrieval in Distributed Computing Networks
Many scientific experiments are carried out in collaboration with researchers around the world to use existing infrastructures and conduct experiments at massive scale. Data produced by such experiments are thus replicated and cached at multiple geographic locations. This gives rise to new challenges when selecting distributed data and compute resources so that the execution of applications is ...
متن کاملData Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملScheduling and management of data intensive application workflows in grid and cloud computing environments
Large-scale scientific experiments are being conducted in collaboration with teams that are dispersed globally. Each team shares its data and utilizes distributed resources for conducting experiments. As a result, scientific data are replicated and cached at distributed locations around the world. These data are part of application workflows, which are designed for reducing the complexity of ex...
متن کاملA Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints
One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...
متن کاملGreen Energy-aware task scheduling using the DVFS technique in Cloud Computing
Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...
متن کامل